Statistical and Semantic Feature Selection for Text Clustering

نویسندگان

Asmaa Benghabrit

Brahim Ouhbi

Hicham Behja

Bouchra Frikh

چکیده

Organizing textual documents by categorizing them is important and beneficial for information retrieval; but when it comes to clustering documents containing a huge number of terms, the task become challenged. Therefore, selecting effective features is essential for reducing the feature space dimensionality and improving the clustering performances. While numerous methods have been developed for this purpose, fewer techniques considered the semantic knowledge that can be incorporate into the clustering process. This paper proposes first a new semantic feature selection method SIM based on the mutual information metric, and second a novel two phase clustering mechanism. The statistical feature selection method CHIR integrates into the frequency clustering stage and then our technique SIM is used in the second stage to pilot the semantic categorization. The content based analysis allows enhancing the frequency clustering by taking the semantic relationships between the features into account. The successful evaluation of our approach demonstrates its relevancy in catching statistical and semantic pertinent features that enable better clustering accuracy in terms of F-measure and purity.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Integrated Clustering and Feature Selection Scheme for Text Documents

Problem statement: Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents with reference to its similarity. Approach: The feature selection techniques were used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from th...

متن کامل

Review on Text Clustering Using Statistical and Semantic Data

The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. Text documents are the unstructured databases that contain raw data collection. The clustering...

متن کامل

Ontology-based Concept Weighting for Text Documents

Documents clustering become an essential technology with the popularity of the Internet. That also means that fast and high-quality document clustering technique play core topics. Text clustering or shortly clustering is about discovering semantically related groups in an unstructured collection of documents. Clustering has been very popular for a long time because it provides unique ways of di...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Statistical and Semantic Feature Selection for Text Clustering

نویسندگان

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Integrated Clustering and Feature Selection Scheme for Text Documents

Review on Text Clustering Using Statistical and Semantic Data

Ontology-based Concept Weighting for Text Documents

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

عنوان ژورنال:

اشتراک گذاری